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to update his state of mind after each time step. We identify two ways of 
predicting by MDL for this setup, namely a static and a dynamic one. (A third 
variant, hybrid MDL, will turn out inferior.) We will prove that under the only 
assumption that the data is generated by a distribution contained in the model 
class, the MDL predictions converge to the true values almost surely. This is 
accomplished by proving finite bounds on the quadratic, the Hellinger, and 
the Kullback-Leibler loss of the MDL learner, which are however exponentially 
worse than for Bayesian prediction. We demonstrate that these bounds are 
sharp, even for model classes containing only Bernoulli distributions. We 
show how these bounds imply regret bounds for arbitrary loss functions. Our 
results apply to a wide range of setups, namely sequence prediction, pattern 
classification, regression, and universal induction in the sense of Algorithmic 
Information Theory among others. 
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1 Introduction 



"Always prefer the simplest explanation for your observation," says Occam's razor. 
In Learning and Information Theory, simplicity is often quantified in terms of de- 
scription length, giving immediate rise to the Minimum Description Length (MDL) 
principle [WB68J IRis78t IGrii98j . Thus MDL can be seen as a strategy against over- 
fitting. An alternative way to think of MDL is Bayesian. The explanations for the 
observations (the models) are endowed with a prior. Then the model having max- 
imum a posteriori (MAP) probability is also a two-part MDL estimate, where the 
correspondence between probabilities and description lengths is simply by a negative 
logarithm. 

How does two-part MDL perform for prediction? Some very accurate answers 
to this question have been already given. If the data is generated by an indepen- 
dently identically distributed (i.i.d.) process, then the MDL estimates are consistent 
BC91J. In this case, an important quantity to consider is the index of resolvability, 
which depends on the complexity of the data generating process. This quantity is a 
tight bound on the regret in terms of coding (i.e. the excess code length). Moreover, 
the index of resolvability also bounds the predictive regret, namely the rate of con- 
vergence of the predictive distribution to the true one. These results apply to both 
discrete and continuously parameterized model classes, where in the latter case the 
MDL estimator must be discretized with an appropriate precision. 

Under the relaxed assumption that the data generating process obeys a central 
limit theorem and some additional conditions, Rissanen [Ris96, BRY98] proves an 
asymptotic bound on the regret of MDL codes. Here, he also removes the coding 
redundancy arising if two-part codes are defined in the straightforward way. The 
resulting bound is very similar to that in CB90j for Bayes mixture codes and i.i.d. 
processes, where the i.i.d. assumption may also be relaxed |Hut03b| . Other similar 
and related results can be found in |GV0H IGV04j . 

In this work, we develop new methods in order to arrive at very general consis- 
tency theorems for MDL on countable model classes. Our setup is online sequence 
prediction, that is, the symbols £i,x 2 , . . . of an infinite sequence are revealed suc- 
cessively by the environment, where our task is to predict the next symbol in each 
time step. Consistency is established by proving finite cumulative bounds on the 
differences of the predictive to the true distribution. Differences will be measured 
in terms of the relative entropy, the quadratic distance, and the Hellinger distance. 
Most of our results are based on the only assumption that the data generating pro- 
cess is contained in the models class. (The discussion of how strong this assumption 
is, will be postponed to the last section.) Our results imply regret bounds with ar- 
bitrary loss functions. Moreover, they can be directly applied to important general 
setups such as pattern classification, regression, and universal induction. 

As many scientific models (e.g. in physics or biology) are smooth, much statistical 
work is focussed on continuous model classes. On the other hand, the largest rele- 
vant classes from a computational point of view are at most countable. In particular, 
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the field of Algorithmic Information Theory (also known as Kolmogorov Complex- 
ity, e.g. |ZL70t ILV971 ICal02l IHut04p studies the class of all lower- semicomputable 
semimeasures. Then there is a one-to-one correspondence of models and programs 
on a fixed universal Turing Machine. (Since programs need not halt on each input, 
models are semimeasures instead of measures, see e.g. |LV97j for details). This model 
class can be considered the largest one which can be in the limit processed under 
standard computational restrictions. We will develop all our results for semimea- 
sures, so that they can be applied in this context, which we refer to as universal 
sequence prediction. 

In the universal setup, the Bayes mixture is also termed Solomonoff-Levin prior 
and has been intensely studied first by Solomonoff |Sol64[ ISol78j . Its predictive 
properties are excellent |HutOH lHut04j. Precisely one can bound the cumulative loss 
by the complexity of the data generating process. This is the reference performance 
we compare MDL to. It turns out that the predictive properties of MDL can be 
exponentially worse, even in the case that the model class contains only Bernoulli 
distributions. Another related quantity in the universal setup is one-part MDL, 
which has been studied in [Hut03c . We will briefly encounter it in Section IHTH 

The paper is structured as follows. Section El establishes basic definitions. In 
Sectional we introduce the MDL estimator and show how it can be used for sequence 
prediction in at least three ways. Sections |U and El are devoted to convergence 
theorems. In Sections El and we study the stabilization properties of the MDL 
estimator. Section El presents more general loss bounds as well as three important 
applications: pattern classification, regression, and universal induction. Finally, 
Section El contains the conclusions. 



2 Prerequisites and Notation 



We build on the notation of |LV97j and |Hu t04j. Let the alphabet A 1 be a finite 



set of symbols. We consider the spaces X* and X°° of finite strings and infinite 
sequences over X. The initial part of a sequence up to a time t G N or t — 1 G N is 
denoted by x\.£ or x <t , respectively. The empty string is denoted by e. 
A semimeasure is a function v : X* — > [0, 1] such that 

u(e) < 1 and u(x) > s ^i , {xa) for all x G X* (1) 

holds. If equality holds in both inequalities of (0), then we have a measure. Intu- 
itively, the quantity v{x) can be understood as the probability that a data generating 
process yields a string starting with x. Then, for a measure, the probabilities of all 
joint continuations of x add up to u(x), while for a semimeasure, there may be a 
"probability leak" Recall that we are interested in semimeasures (and not only 
in measures) because of their correspondence to programs on a fixed universal Turing 
machine in the universal setup and our inability to decide the halting problem. 



3 



Let C be a countable class of (semi)measures, i.e. C = {vi : i £ J} with finite 
or infinite index set / C ff. A (semi) measure v dominates the class C iff for every 
Vi G C there is a constant Q > such that u(x) > Ci ■ Vi{x) holds for all x G X*. A 
dominant semimeasure v need not be contained in C. 

Each (semi)measure v G C is associated with a weight w u > 0, and we require 
^ u w v < 1- We may interpret the weights as a prior on C. Then it is obvious that 
the Bayes mixture 

t{x)=t [c] {x):=Y,™vv{x) (forxG**) (2) 

dominates C. Assume that there is some measure G C, the true distribution, 
generating sequences x <OQ G X°°. Typically fi is unknown. (Note that we require 
fi to be a measure: The data stream always continues, there are no "probability 
leaks".) If some initial part x<t of a sequence is given, the probability of observing 
x t G X as a next symbol is 

fi(x t \x <t ) = ^ <l ^ if fi(x <t ) > and ji(xt\x <t ) = if fi(x <t ) = 0. (3) 
H{x <t ) 

and, for well-defmedness, fi(x t \x <t ) = if /i(x <t ) = (this case has probability zero). 
Note that fi(x t \x <t ) can depend on the complete history x <t . We may generally 
define the quantity for any function <p : X* — > [0, 1]; we call (p(x t \x <t ) := < ^^~\ 
the if '-prediction. Clearly, this is not necessarily a probability on X for general (p. 
For a semimeasure v in particular, the ^-prediction u(-\x <t ) is a semimeasure on X. 

We define the expectation with respect to the true probability \i: Let n > and 
/ : rY™ — > R be a function, then 

E / = E f(x lm ) = K X l:n)f(xi:n)- (4) 

More general, the expectation may be defined as an integral over infinite sequences. 
But since we won't need it, we can keep things simple. The following is a central 
result about prediction with Bayes mixtures in a form independent of Algorithmic 
Information Theory. 

Theorem 1 For any class of (semi) measures C containing the true distribution fi, 
which is a measure, we have 

00 2 

5Z E 5Z (^( a \ x <t) ~ fWz<t)) < (5) 

t=l aeX 

This was found by Solomonoff ( Sol78j) for universal sequence prediction. A proof 
is also given in |LV97j (only for binary alphabet) or Hut04j (arbitrary alphabet). 



It is surprisingly simple once Lemma El is known. A few lines analogous to ()14|) and 
(|15|) exploiting the dominance of £ are sufficient. 
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One should be aware that the condition fi e C is essential in general, for both 
Bayes and MDL predictions GL04]. On the other hand, one can show that if there 
is an element in C which is sufficiently close to /i in an appropriate sense, then still 
good predictive properties hold |Hut03bj . 

Note that although w u can be interpreted as a prior on the model class, we do not 
assume any probabilistic structure for C (e.g. a sampling mechanism). The theorem 
rather states that the cumulative loss is bounded by a quantity depending on the 
complexity lnu;" 1 of the true distribution. The same kind of assertion will be proven 
for MDL predictions later. 

The bound © implies that the ^-predictions converge to the /x-predictions al- 
most surely (i.e. with //-probability one). This is not hard to see, since with the 
abbreviation s t = J2 a (/ i ( a l x <t) — £( a \ x <t)) 2 an d for each e > 0, we have 

P(lt>n:s t >e) = p(|J{ s *>4) 

t>n 

1 oo 

< $>(*,> e) < - £ Y,Vs t ™ 0. (6) 

t>n " t=n 

Actually, (j3J) yields an even stronger assertion, since it characterizes the speed of con- 
vergence by the quantity on the right hand side. Precisely, the expected number of 
times t in which £(a|x<t) deviates by more than e from /i(a|x<t) is finite and bounded 
by Inw^/e 2 , and the probability that the number of e-deviations exceeds ln J^ is 
smaller than 5. (However, we cannot conclude a convergence rate of St = o(j) from 
(JHJ), since the quadratic differences generally do not decrease monotonically.) 

Since we will encounter this type of convergence (0) frequently in the following, 
we call it convergence in mean sum (i.m.s): 

(p 1 ^- H 3 C > : 5^Ej] (fi(a\x <t ) - (p{a\x <t )) < oo. (7) 

t=l adX 

Then Theorem ^ states that the £ predictions converge to the \i predictions i.m.s., 
or "£ converges to \i i.m.s." for short. By (jBj). convergence i.m.s. implies almost sure 
convergence (with respect to the true distribution fi). Note that in contrast, conver- 
gence in the mean, i.e. E[^ a (/x(a|x <t ) — <p(a\x <t )) 2 } ^5 0, only implies convergence 
in probability. 

Probabilities vs. Description Lengths. By the Kraft inequality, each (semi)- 
measure can be associated with a code length or complexity by means of the negative 
logarithm, where all (binary) codewords form a prefix- free set. The converse holds 
as well. We introduce the abbreviation 

K ... — -log 2 . . . , e.g. Kv(x) = -\og 2 u(x) (8) 

for a semimeasure v and x G X* and K^(x) = — log 2 £(x) for the Bayes mixture £. It 
is common to ignore the somewhat irrelevant restriction that code lengths must be 
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integer. In particular, given a class of semimeasures C together with weights, each 
weight w u corresponds to a description length or complexity 

Kw(v) = -\og 2 w u . (9) 

It is often only a matter of notational convenience if description lengths or prob- 
abilities are used, but description lengths are generally preferred in Algorithmic 
Information Theory. Keeping the equivalence in mind, we will develop the gen- 
eral theory in terms of probabilities, but formulate parts of the results in universal 
sequence prediction rather in terms of complexities. 



3 MDL Estimator and Predictions 

Assume that C is a countable class of semimeasures together with weights (w v ) ue c, 
and x G X* is some string. Then the maximizing element u x , often called MAP 
(maximum a posteriori) estimator, is defined as 

v x = u? c] = arg max{w v v(x)}. (10) 

In case of a tie, we need not specify the further choice at this point, just pick any of 
the maximizing elements. But for concreteness, you may think that ties are broken in 
favor of larger prior weights. The maximum is always attained in ()10|) since for each 
e > at most a finite number of elements fulfil w v v[x) > e. Observe immediately 
the correspondence in terms of description lengths rather than probabilities: 

v x = argmin {Kw(v) + Kv(x)\. 

Then the minimum description length principle is obvious: v x minimizes the joint 
description length of the model plus the data given the model 1 (see © and ©)• As 
explained before, we stick to the product notation. 

For notational simplicity, let v*(x) = u x (x). The two-part MDL estimator is 
defined by 

g(x) = Q[e](x) = w u xb> x (x) = max{w;j,z/(a;)}. 

So q chooses the maximizing element with respect to its argument. We may also 
use the version g y (x) := w vV v v [x) for which the choice depends on the superscript 
instead of the argument. Note that the use of the term "estimator" is non-standard, 

lr The term MAP estimator is more precise. For two reasons, our definition might not be con- 
sidered as MDL in the strict sense. First, MDL is often associated with a specific prior, while we 
admit arbitrary priors (compare the discussion section at the end of this paper). Second, when 
coding some data x, one can exploit the fact that once the distribution v x is specified, only data 
which leads to this v x needs to be considered. This allows for a description shorter than Kw(v x ). 
Nevertheless, the construction principle is commonly termed MDL, compare e.g. the "ideal MDL" 
in EM- 
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since g is a product of the estimator v* (this use is standard) and its prior weight. 
There will be no confusion between these two meanings of "estimator" in the fol- 
lowing. 

For each x, y G X*, 

£(X) > Q{X) > Q V {X) (11) 

is immediate. If C contains only measures, we have Yl a @i xa ) — Y2 a @ x ( xa ) = 
g x (x) = g(x) for all x G X*, so g has some "anti-semimeasure" property. If C 
contains semimeasures, no semimeasure or anti-semimeasure property can be estab- 
lished for q. 

We can define MDL predictors according to (|HJ). There are basically two possible 
ways to use MDL for prediction. 

Definition 2 The dynamic MDL predictor is defined as 

g(xa) g xa (xa) 



g(a\x) 



g(x) g x {x) 



That is, we look for a short description of xa and relate it to a short description of 
x = x<t. We call this dynamic since for each possible a we have to find a new MDL 
estimator. This is the closest correspondence to the Bayes mixture ^-predictor. 

Definition 3 The static MDL predictor is given by 

£ static (a|x) = g x {a\x) 



Man,, i , i , </(■>-«) '/(■'-«) " ''(■>■«) 



g(x) g x (x) u x (x) 

Here obviously only one MDL estimator g x has to be identified. This is usually more 
efficient in practice. 

We will define another MDL predictor, the hybrid one, in Section |U1 It can be 
paraphrased as "do dynamic MDL but drop weights" . We will see that its predictive 
performance is weaker. 

The range of the static MDL predictor is obviously contained in [0, 1]. For the 
dynamic MDL predictor, this holds by 

f{x) > g xa (x) > g xa {xa). (12) 

Static MDL is omnipresent in machine learning and applications, see also Sec- 
tion |SJ In fact, many common prediction algorithms can be abstractly understood 
as static MDL, or rather as approximations. Namely, if a prediction task is accom- 
plished by building a model such as a neural network with a suitable regularization 2 
to prevent "overfitting" , this is just searching an MDL estimator within a certain 
class of distributions. After that, only this model is used for prediction. Dynamic 



2 There are however regularization methods which cannot be interpreted in this way but build 
on a different theoretical foundation, such as structural risk minimization. 
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MDL is applied more rarely due to its larger computational effort. For example, 
the similarity metric proposed in |LCL + 03] can be interpreted as (a deterministic 
variant of) dynamic MDL. 

We will need to convert our MDL predictors to measures by means of normal- 
ization. If ip : X* — > [0, 1] is any function, then 

f(a\x) ip(xa) 



is a measure (assume that the denominator is different from zero, which is always 
true with probability 1 (w.p.l) if (p is an MDL predictor). This procedure is known 
as Solomonoff normalization [Sol78 , LV97] and results in 

, n = <p(xi:n) TT <f{ x <t) _ fi^hn) 

where 

- n d3) 

is the normalizer. 

We conclude this section with a simple example. 

Bernoulli and i.i.d. classes. Let n £ N, X = {1, . . . , n}, and 

n 

C = {ue( Xl *)=# xl -...-0 xt Jee} with e = {^([0,l]nQ) n :^^ = l} 

1=1 

be the set of all rational probability vectors with any prior (m#)# e e. Each $ £ 
generates sequences x <00 of independently identically distributed (i.i.d.) random 
variables such that P(x t = i) = i?j for all t > 1 and 1 < i < n. If X\ :t is the initial 
part of a sequence and a £ 9 is defined by = ||{s < t : x s = then it is easy 
to see that 

v xvt = &rgmm{Kw($)-\n2 + t-D(a\\'d)} , 

where Z)(ck||i?) := EILi a « l n f 1 is the Kullback-Leibler divergence. H\X\ = 2, then 
is also called a Bernoulli class, and one usually takes the binary alphabet X = {0, 1} 
in this case. 



4 Dynamic MDL 

We may now develop convergence results, beginning with the dynamic MDL pre- 
dictor from Definition |21 The following simple lemma is crucial for all subsequent 
proofs. 
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Lemma 4 For an arbitrary class of (semi) measures C, we have 

(i) q{x) — s ^q{xo) < £(x) — y~]^(xa) and 

a&X aeX 
a£X aeX 

for all x G X* . In particular, £ — g is a semimeasure. 
Proof. For all x G X*, with / := £ — g we have 

^2 f(xa) = (^( xa ) _ e( xa )) < (^( xa ) ~ s x ( xa )j 

a£X a£X a£X 

= ^2 ^2w u v(xa) < ^2 w ^( x ) = t( x ) - Q( x ) = f( x )- 

v£M\{v x }a£X veM\{V} 

The first inequality follows from g x (xa) < g(xa), and the second one holds since 
all v are semimeasures. Finally, fix) = £(x) — g(x) = YlveM\{v x } w v l '( x ) — ® an< ^ 
/( e ) = £( e ) ~~ £?( e ) — 1- Hence / is a semimeasure. □ 

The following proposition demonstrates how simple it can be to obtain a conver- 
gence result, however a weak one. Various similar results have been already obtained 
in the past, e.g. in |BD62t lBar85j . 

Proposition 5 For any class of (semi) measures C containing the true distribution 
H, we have 

Q( x t\ x <t) , i 
— > 1 w.fi.p.l 



fJj^Xf \ x 



<t) 

Proof. Since £ — g is a positive semimeasure by Lemma 01 is a positive super- 
martingale. By Doob's martingale convergence theorem (see e.g. |Doo53j or [UT88J 
or any textbook on advanced probability theory), it therefore converges on a set of 
/x-measure one. Moreover, - converges on a set of measure one, being a positive 
super-martingale as well Thm.5.2.2]. Thus - must converge on a set of 

measure one. We denote this limit by / and observe that / > since - > 
everywhere. On this set of measure one, the denominator g(x <t )/ fi(x <t ) of 

gjxy.t) / ^{xy.t) _ g{x t \x <t ) 
Q( x <t)/fJ-(x<t) fJ,(x t \x <t ) 

converges to / > 0, and so does the numerator. The whole fraction thus converges 
to one, which was to be shown. □ 

Proposition |S] gives only a statement about "on-sequence" (g(x t \x <t )) conver- 
gence of the ^-predictions. Indeed, no conclusion about "off-sequence" convergence, 
i.e. g(a\x <t ) for arbitrary a G X, can be drawn from the proposition, not even in the 
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deterministic case. There, the true measure p is concentrated on the particular se- 
quence x<oo- So for a ^ x t , we have p(x <t a) = 0, and thus no assertion for g(a\x <t ) 
can be made. On the other hand, an off-sequence result is essential for prediction: 
Even if on the correct next symbol the predictive probability is very close to the true 
value, we must be sure that this is so also for all alternatives. This is particularly 
important if we base some decision on the prediction; compare Section ISTT1 

The following theorem closes this gap. In addition, it provides a statement 
about the speed of convergence. In order to prove it, we need a lemma establishing 
a relation between the square distance and the Kullback-Leibler distance, which is 
proven for instance in |Hut04[ Sec. 3. 9. 2]. 

Lemma 6 Let p and p be measures on X , then 

Yl (M a ) - p(«)) 2 < Yl ^( a ) ln T^y- 

a€X adX ^ ' 

Theorem 7 For any class of (semi) measures C containing the true distribution p 
(which is a measure), we have 

oo 

Y E ^ (K a \ X <t) - Qnorm(a\x <t )) 2 < W" 1 + lntU" 1 . 
t=l aeX 

That is, £norm(a|;r<t) lj — 4' p(a\x <t ) (see ffi)), which implies g n orm( a \ x <t) — >■ p{a\ x <t) 
with p -probability one. 

Proof. Let nGN. From Lemma EH we know 

n n / I \ 

p{a\x <t ) 



y E^ (p(a\x <t ) - gnormQk<f)) 2 < ^ E ^ p(a\x <t ) ln ■ 

t=l a&X t=l aeX Qnorm[a\X<t) 



= y Ein ^ X } X <A =y E 

^ Qnorm{Xt\X<t) ~ 

Then we can estimate 



a^x @{x<ta) 



g(x t \x <t ) g(x<t) 



(14) 



y Eln M^|x<t) = E ln frM*tl*<t) = E ln /^ < 

^ g(xt\x <t ) fjr g{x t \x <t ) g{xi :n ^ 



t=l t=l 



since always ^ < w^ 1 . Moreover, by setting x = x <t , using In it < u — 1, adding an 
always positive max-term, and finally using - < w~ l again, we obtain 



Eln£°f ( *<* o) <E 



Eg gW 



g(x) 

e(x)=t-i ^ v ' 
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< 



E 



(x)=t-l 



K x ) (J2aex q( x °)) - e( x ) + max I ' e( x ) - Y. a ex &( xa ) } 



g(x) 



<- < E 

£(x)=t-l 



£>(xa)^ — £>(:r) + max |o, g(x) — £>(a;a) j 



aex 



(16) 



If C contains only measures, the max-term is not necessary, since g is an "anti- 
semimeasure" in this case. We proceed by observing 



E E [(E^ xa v ~ g ( x ">\ = E [E^ x ) -E^ x ) 

t=l l(x)=t-l aeX t=l l(x)=t l(x)=t-l 



'.(x)=n 



e 



(17) 



which is true since for successive t the positive and negative terms cancel. From 
Lemma |U we know g(x) — J2 a ex Q( xa ) < £( x ) ~ Ylaex £( xa ) an d therefore 

n n 

^^maxjo, g(x) - J^^(xa)| < ^ ^ max jo, £(x) - £(xa) J 



t=l i(x)=t-l 



aex 



EEf^-E^ 

t=i e(x)=t-i aex 



t=l i(x)=t-l 



aex 



E 

l{x)=n 



Here we have again used the fact that positive and negative terms cancel for suc- 
cessive t, and moreover the fact that £ is a semimeasure. Combining (|16p. (JT7j) and 
|T8|) . and observing £> < £ < 1, we obtain 



E Eln 



Eqg(^ a ) < -1 



£(e) - g(e) + (g(x) - < ti^tfe) < u;; 1 . 

(19) 

Therefore, (|14j) . (|15|) and (|19|) finally prove the assertion. □ 

We point out again that the proof gets a bit simpler if C contains only measures, 
since then ()18|) becomes irrelevant. However, this case doesn't give a tighter bound. 

This is the first convergence result "in mean sum", see (J7|). It implies both 
on-sequence and off-sequence convergence. Moreover, it asserts the convergence 
is "fast" in the sense that the sum of the total expected deviations is bounded by 
wZ 1 +hxw~ 1 . Of course, w' 1 can be very large, namely w~ l = 2 Kw ^\ The following 
example will show that this bound is sharp (save for a constant factor). Observe 
that in the corresponding result for mixtures, Theorem Q the bound is much smaller, 
namely lnw" 1 = Kw(/j) In 2. 

Example 8 Let X = {0,1}, N > 1 and C = {i/i, . . . , Vjv_i, jw}- Each is a 
deterministic measure concentrated on the sequence z^j^ = l l_1 0°°, while the true 
distribution /x is deterministic and concentrated on x <QO = 1°°. Let w Ui = = 
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for all i. Then /i generates x <OQ , and for each t < N — 1 we have £> n orm(0|x<t) = 

&orm(l|x< t ) = \. Hence, J2t E E« (K a \ x <t) ~ £Wm(a|£<t)) 2 = §(iV - 1) = w' 1 . In 
Example ^] we will even see a case where the model class contains only Bernoulli 
distributions and nevertheless the exponential bound is sharp. 

The next result implies that convergence holds also for the un-normalized dy- 
namic MDL predictor. 

Theorem 9 For any class of (semi) measures C containing the true distribution fi, 
we have 





oo 

t=l 


In 22 e( a \ x <t) 


< Zw- 1 


and 




(«) 


^2 E f?norm( a \x<t 


) - g{a\x <t ) 


oo 

= EH 1 

t=l 


- ^2 Q(a\x <t ) 

a£X 


Proof. 


(z) Define u + = max{C 


, u] for u G 


R, then for x 


:= x <t e x 1 - 1 



< 2w~\ 



Ea Q( Xa ) 



g[x 



E lny^g(a|x) = E In 

aeA" 

< E (^2 a g(xa) - g(x)) + E ( 
g(x) 

v( x )(J2 a e( xa ) - e( x )) + 



E 



ln Ea g W 



gw 



hi 



Ea Q( Xa ) 



E 



Ea ^( Xa ) 

/x(x)(g(x) - Ea£( Xa ))~ 



g(x) 



E 



Ea e( xa ) 



e( x )=t-i " v ' £(x)=t-i 

- ^^(Ea^M - g( ;E )) + + W" 1 ^^) - Eag( Xa )) + 
<(aj)=t-l l(as)=t-l 

= ^/T 1 Yl I^^Ea PNI = ^/T 1 Ea ?(™) - g( S ) + 2 (g(£)-Ea ?(™)) + ] 



(x)=t-l 



^(a;)=t-l 



Here, |w| = u + + (— u) + = — u + 2u + , \nu < u — 1, and £> > w^/i have been used, the 
latter implies also £>(xa) > u)^J2 a fi(xa) = w^n(x). The last expression in this 
(in)equality chain, when summed over t = I...00 is bounded by 2-U7" 1 by essentially 
the same arguments (|16|) - f)19j) as in the proof of Theorem [7| 

(ii) Let again x := x <t and use gnorm^k) = g(a\x)/ J2b ^ip\ x ) to obtain 

Y\^orm(a\x) - g(a\x) = ^ g( a K> |i_^g(6|a;) = |l-^g(&|a;) (20) 

(Eag( g °) ~ g( X )) + , ~ EqgN) + 



e(z) ^xj 

Then take the expectation E and the sum Et^i an< ^ proceed as in (z). 



□ 
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Corollary 10 For any class of (semi)measures C containing the true distribution 
fi, we have 



t=l aGX 

That is, g(a\x <t ) % '-^> /i(a\x <t ) (see ffi)). 
Proof. For two functions <pi, <p% : X* — > [0, 1], let 

i 

tOO 2 \ 2 

X] E S (?M x <t) - <p*{p\ x <t)) • ( 21 ) 
t=l aeX J 

Then the triangle inequality holds for A(-,-), since A is (proportional to) an Eu- 
clidian distance (2-norm). Moreover, A(/i, £ norm ) < ^2w~ 1 by Theorem and 
lniu" 1 < w^ 1 — 1 < w' 1 . We also have A(g noim , g) < y/2w~ 1 by multiply- 
ing \g norm — g\ in Theorem with another |^ norm — q\- Note \g norm — Q\ < 1) 
since both g(a\x) , g norm (a\x) G [0,1], for g this holds by (fT2|) . This implies 
A(/x, g) < A(/i, £Wm) + A(^ norm , g) < 2^/2w-~ l . □ 

Corollary 11 For almost all x <00 G X°° , the normalizer N e defined in Alty) con- 
verges to a number which is finite and greater than zero, i.e. < N e (x <00 ) < oo. 
Moreover, the sum of the MDL posterior estimates converges to one almost surely, 

g(a\x <t ) = -, r > 1 as t — > oo w./i.p.l. (22) 

g{x<t) 



Proof. Theorem |H1 implies that with probability one, the sum 2~2i I hi I is 



bounded in n, hence converges absolutely, hence also the limit 
exists and is finite. For these sequences, < N g (x <OCJ ) < oo and follows. □ 



5 Static MDL 

Static MDL as introduced in Definition El is usually more efficient and thus preferred 
in practice, since only one MDL estimator has to be computed. The following 
technical result will allow to conclude that the static MDL predictions converge in 
mean sum like the dynamic ones. 
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Theorem 12 For any class of (semi) measures C containing the true distribution //, 
we have 



\fi£m( a \ x <t) ~ Q x<t (a\x <t ) = ^E|l-^^<*(a|x <t ) 



t=l a£X 



t=l 



< V- 



Proof. We proceed in a similar way as in the proof of Theorem [7[ (j!6|) - ()18j) . From 
Lemma |H we know £>(x) — Y^ a g x (xa) < £(x) — ^2 a ^(xa). Then 



J>|l-5>*<<(a|x <t ) = £ E 



i=l 



,g(g<t) -EagA- x<t (x <t a) 
o( x <t) 



± y: ,{x) Q{x) -^ QX{xa) 



t=l l{x)=t-l 
n 

£ < E E 

t=i ^(x)=t-i 

n 

^ < E E 

t=l £(x)=t-l 



Q[X 

q(x)-J2q x ( 



xa) 



< w 



-1 



£(x)=n 



for all n G N. This implies the assertion. Again we have used ^ < w^ 1 and the fact 
that positive and negative terms cancel for successive t. □ 

Corollary 13 For any class of (semi)measures C containing the true distribution 
H, we have 



(' u ( a l a; <*) _ Q X<t { a \ x <t)) < 21w M 1 and 

t=l a&X 

(v( a \ X <t) ~ fi£m( a \ X <tj) < 32W M 1 - 

t=l aGA" 

T/iai zs ; Q x<t (a\x <t ) /i(a|x <t ) and ^ r * m (a|x <t ) l ^ fi(a\x <t ). 
Proof. Using g(xa) > g x (xa) and the triangle inequality, we see 



E 



g{a\x) — g (a\x) 



2^ g(a\x)— E^ g x {a\ x ) 

a a 



< \ s ^g(a\x)-l ^ g x (a\x) 
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With A(-, •) as in (|21|). using \g — g x \ < 1 we therefore have 



A 2 (g, ^ static )<^E^L(a|x)-^ 



— p [a \x. 



t=i 



<3< 



according to Theorem (ii) and Theorem Since A(/i, g) < l^ur^ holds by 
Corollary CUJ we obtain A(/i, £ static ) < A(fi, g) + A(g } £ static ) < ^2lw^. Theorem 
DJalso asserts A(^ static , g s ^) < ^/w^ 1 , hence A(/i, g s ^) < v /32w" 1 follows. □ 

Distance measures. The total expected square error is not the only possible 
choice for measuring distance of distributions and speed of convergence. In fact, 
looking at the proof of Theorem [7[ the expected Kullback-Leibler distance may 
seem more natural at a first glance. However this quantity behaves well only under 
dynamic MDL, not static MDL. To see this, let C = {0, |} contain two Bernoulli 
distributions, both with prior weight |, and let \i == \ be the uniform measure. If 
the first symbol happens to be 0, which occurs with probability |, then the static 
MDL estimate is u° = 0. Then D([i\\i> ) = oo, hence the expectation is oo, too. 
The quadratic distance behaves locally like the Kullback-Leibler distance (Lemma 
EJ), but otherwise is bounded and thus more convenient. 
Another possible choice is the Hellinger distance 

h t (fi,if)\ x<t = ^ fyV(a|x <f ) - y/ip(a\x <t )) and (23) 

n 

H hn (fjL,<p) = J2 Eh t(v,<f)- (24) 



t=i 



Like the square distance, the Hellinger distance is bounded by both the relative 
entropy and the absolute distance: 

h t (lM,<p) < y^u(a\x <t ) In ari( } (25) 

h t {fJ,,(p) < ^2\fJ,{a\x <t ) - cp(a\x <t ) . (26) 
aex 

The former is e.g. shown in |Hut04[ Lem.3.11, p. 114], the latter follows from (\/u — 
y/v) 2 < \u — v\ for any u, v G R. Therefore, the same bounds we have proven for the 
square distance also hold for the Hellinger distance; they are subsumed in Corollary 
El below. Although for simplicity of notation we have preferred the square distance 
over the Hellinger distance in the presentation so far, in Sections 18. II and 18.31 we will 
meet situations where the quadratic distance is not sufficient. Then the Hellinger 
distance will be useful. 

The following corollary recapitulates our results and states convergence i.m.s 
(and therefore also w./i-p.l) for all combinations of un-normalized/normalized and 
dynamic/static MDL predictions. 
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Corollary 14 Let C contain the true distribution fi, then 

*S , <oo(/ i ) f?norm) — 2l<J , H <00 ( K fl, £> n orm) ^ 2li>^ , 

S <O0 (ji, q) < 8w~\ H <00 (n, g) < 8w~\ 
S <oa (ji, Q static ) < 21w~\ H <a0 (ji, g st * tic ) < 21w-\ 

5<oo(/i, e££) < S2w;\ H <00 (», < 32 W ;\ 

where S <0O (fi, if) = E (/i(a|x< t ) — y?(a|x< t )) 2 and -ff <00 as m 

The following example shows that the exponential bound is sharp (except for 
a multiplicative constant), even if the model class contains only Bernoulli distribu- 
tions. It is stated in terms of static MDL, however it equally holds for dynamic 
MDL. 

Example 15 Let N > 1 and C = 6 = {\}U{\ + 2- k - 1 : 1 < k < N} be a Bernoulli 
class as discussed at the end of Sectional Let \i be Bernoulli with parameter |, i.e. 
the distribution generating fair coin flips. Assume that all weights are equally ^L_. 
Then it is shown in [PH04b ( Prop. 5] that 



J>(|-^(l|x <t )) 2 >^(iV-4). 



t=i 



So the bound equals w^ 1 within a multiplicative constant. 

This shows that in general there is no hope to improve the bounds, even for 
very simple model classes. But the situation is not as bad as it might seem. First, 
the bounds may be exponentially smaller under certain regularity conditions on 
the class and the weights, as |Ris96 and the positive assertions in |PH04bj show. 
It is open to define such conditions for more general model classes. Second, the 
example just given behaves differently than Example |S| There, the error remains at 
a significant level for 0{w~ 1 ) time steps, which must be regarded critical. Here in 
contrast, the error drops to zero as - for a very long time, namely 0(2 W ^ ) steps, 
and decreases more rapidly only afterwards. This behavior is tolerable in practice. 
Recently, [Ll99l IZha04] have proven that this favorable case always occurs for i.i.d., 
if the weights satisfy the light tails condition w u — 1 f° r some a < 1 jBTM] . 
Precisely, they give a rapidly decaying bound on the instantaneous error. It is open 
if similar results also hold in more general setups than i.i.d. Example |S1 shows that 
at least some additional assumption is necessary. 



6 Hybrid MDL 

So far, we have not cared about what happens if two or more (semi)measures obtain 
the same value w u u(x) for some string x. In fact, for the previous results, the tie- 
breaking strategy can be completely arbitrary. This need not be so for all thinkable 



16 



prediction methods, as we will see with the hybrid MDL predictor in the subsequent 
example. 



Definition 16 The hybrid MDL predictor is given by 

v 1 1 v*(x) 

(compare (J1UD ). This can be paraphrased as "do dynamic MDL and drop the 
weights". It is somewhat in-between static and dynamic MDL. 

Example 17 Let X = {0, 1} and C contain only two measures, the uniform measure 
A which is defined by X(x) = 1~~^ x \ and another measure v having v{\x) = 2~^ x > 
and u(0x) = 0. The respective weights are W\ — | and w v = |. Then, for each 
x starting with 1, we have w v v{x) = w\X(x) = \2~ l ^ x ' +1 . Therefore, for all x <OQ 
starting with 1 (a set which has uniform measure |), we have a tie. If the maximizing 
element u* is chosen to be A for even t and v for odd t, then both static and dynamic 
MDL predict probabilities of constantly 

1 w i v / i v w x \(x <t a) w v v{x <t a) 

2 = \{a\x <t ) = v{a\x <t ) = = - 

for all a G {0, 1}. However, the hybrid MDL predictor values ^~Jrr^A oscillate 
between | and 1. 

If the ambiguity in the tie-breaking process is removed, e.g. in favor of larger 
weights, then the hybrid MDL predictor does converge for this example. We replace 
(dnj) by this rule: 

v x = argmaxjuv : v G {v = arg maxw u u(x)}}. 

Then, do the hybrid MDL predictions always converge? This is equivalent to asking 
if the process of selecting a maximizing element eventually stabilizes. If stabilization 
does not occur, then hybrid MDL will necessarily fail as soon as the weights are not 
equal. A possible counterexample could consist of two measures the fraction of 
which oscillates perpetually around a certain value. We show that this can indeed 
happen, even for different reasons. 

Example 18 Let X be binary, jj(x) = nflfi fa&i) an d = Yli=i u i( x i) with 

^(1) = 1 - 2- 2 Tfl and ^(1) = 1 - 2- 2 ^l +1 . 

Then one can easily see that . . .) = YIT > 0' . . .) = YIT u iW > ®i 

and ^Mjj^ converges and oscillates. In fact, each sequence having positive measure 
under /i and v contains eventually only ones, and the quotient oscillates. 
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f(n)=S- *(in)=i 9 - 



1 fje)=1 



f(ooi)=i- 221hil f(ioi)=»- 

3/4 " f < 01 °)=§? f ( ro)=u- f(iio)^§§- 

[*°L = i t(ioo)=jg- 

f(00)=fe- 

3/8 



f(000)= h- 



0. 



1/8 1/4 3/8 1/2 5/8 3/4 7/8 1 

Figure 1: Construction of a martingale that with high probability converges to | 
oscillating infinitely often. 



Example 19 This example is a little more complex. We assume the uniform dis- 
tribution A to be the true distribution. We now construct a positive martingale /(•) 
that converges to | with high probability and thereby oscillates infinitely often. 

The martingale is defined on strings x of successively increasing length. Of 
course, /(e) := 1. If f(x) is already defined for strings of length n — 1, we extend 
the definition on strings of length n in the following way: If fix) > |, we set 

f(x0) := Jj-2~™- 2 and 

f(xl) := 2/(s) - (! - 2— 2 ). 

This guarantees the martingale property f(x) = ^(f(x0) + f(xl)). If f(x) < | and 
f(x) > | + 2" n - 3 , then we can similarly define 

f(x0) := 2 f{x) - (- + 2~"- 2 ) and 

f(xl) := ^ + 2— 2 . 

However, if f(x) < | + 2 _n_3 , we cannot proceed in this way, since / must be 
positive. Therefore, we set f{x0) := f{xl) := f{x) in this case and call those x 
"dead" strings. Strings that are not dead will be called "alive" . A few steps of the 
construction are shown in Figure ^ For example, it can be observed that the string 
000 is dead, all other strings in the figure are alive. 

It is obvious from the construction that f{x\-t) is a martingale, it oscillates and 
converges to | as t — > oo for all sequences x <oa that always stay alive. The only 
thing we must show is that many sequences in fact stay alive. 
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Claim 20 We have A({x <00 : 3t such that x\-t is dead}) < j. 

Proof. After the nth step, i.e. when / has been defined for strings of length n, f(x) 
assumes the value 



3 o-n-2 



on a set of measure at most |. In the next step n + 1, / is defined to 

<=i"2— (l + i) 
on half of the extended strings. Generally, in the kth next step, / is defined to 

3 _ 2 _ n+fe _ 2 ^ 2 _ 



j=0 

on a 2 _fc fraction of the extended strings. 

The extended strings stay alive as long as a\ > | + 2- n - k - 3 holds. Some ele- 
mentary calculations show that this is equivalent to k < n. So precisely after n + 1 
additional steps, a fraction of 2~ n_1 of the extended strings die. 

We already noted that for A n = {x : £(x) = n A f(x) = a^ - }, we have X(A n ) < |. 
Thus, 

\{{x <00 : xi :n G A n and x 1:2n+ i is dead}) < 2 _n_2 . 
Hence, one can conclude 

A({a;<oo : 3t such that xi :t is dead}) < ^^2 _n_2 = -, 

n=l 

which proves the claim. □ 
We now define the measure v by 

v{x) = f(x) ■ X(x) = f(x) ■ 2~^\ 

and set the weights to w\ = 1 and uv = |. Then this provides an example where 
the maximizing element never stops oscillating with probability at least |. 

Both examples point out different possible reasons for failure of stabilizing. Ex- 
ample works since the measure [i and v are asymptotically very similar and close 
to deterministic. In contrast, in Example 1191 stabilizing fails because of lack of in- 
dependence: The quantity v{a\x) strongly depends on x. In particular, one can 
note that even Markovian dependence may spoil the stabilization, since v{a\x) only 
depends on the last symbol of x. 
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7 Stabilization 



In the light of the previous section, it is therefore natural to ask when the maximizing 
element stabilizes (almost surely). Barron [Ba r85[ IBRY98j has shown that this 
happens if all distributions in C are asymptotically mutually singular. Under this 
condition, the true distribution is even eventually identified almost surely. 3 

The condition of asymptotic mutual singularity holds in many important cases, 
e.g. if the distributions are i.i.d. However, one cannot always build on it. 4 There- 
fore, in this section we give a different approach: In order to prevent stabilization, 
it is necessary that the ratio of two predictive distributions oscillates around the 
inverse ratio of the respective weights. Therefore, stabilization must occur almost 
surely if the ratio of two predictive distributions converges almost surely but is not 
concentrated in the limit. This is satisfied under appropriate conditions, as we will 
prove. We start with a general theorem which allows to conclude almost sure sta- 
bilization in a countable model class, if for any pair of models we have almost sure 
stabilization. 

Theorem 21 Let C be a countable class of (semi)measures containing the true mea- 
sure fi. Assume that for each two V\,Vi G C the maximizing element chosen from 
{^i?^} eventually stabilizes almost surely. Then also the maximizing element cho- 
sen from all of C stabilizes almost surely. 



Proof. It is immediate that the maximizing element chosen from any finite subset 
of C stabilizes almost surely. Now, for all v G C and c > 0, define the set A v c by 



A v c = I x<<x> '■ 3 t > 1 such that — — — > c 



Then we have 



< 



fi(x) : > c A ) J . <cVs< £(x) 

/j(x) ti{Xl: S ) 

Y\^:^>cA^4<cVs<1( X ) 

lYM?) ■....} < K 



3 In general, stabilization does not imply that the true distribution is identified. Consider for 
instance a model class containing two measures: the true measure is concentrated on 0°° and has 
prior weight |, the other one assigns probability v{x t — 1) — 2~* independently of the past x <t . 
Then the maximizing element will remain the incorrect distribution however with predictions 
rapidly converging to the truth. 

4 Here is a simple example: let the true measure be Bernoulli^ ) and another measure be 
a product of Bernoullis with parameter rapidly converging to ^. These distributions are not 
asymptotically mutually singular, nevertheless a.s. stabilization holds, as we will see. 
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since v is a (semi)measure and the set jrr G X* : > c A ^|^ :s | < c V s < £{x)^ 
is prefix-free. Let 

B» = {* <oc : 3 t > 1 snch that > l} = A^,, 

then n(B u ) < ^ holds. We arrange the (semi) measures z/ G C in an order z/i, 1/2, . . . 
such that the weights w ul ,w U2 , . . . are descending. For each c > 1, we can now find 
an index k and a set 

w. 



■A4 = {^j : « > A;} such that to,, < 

Defining 5 C = {J ueAfc B v , we get 

m(-Bc) < E - s 1 

For all a; <00 ^ B", z/ can never be the maximizing element. Therefore, for all 
x<oo ^ B c , there are only finitely many v M c having the chance of becoming the 
maximizing element at any time. By assumption, the maximizing element chosen 
from the finite set C \J\f c stabilizes a.s. Thus, we conclude almost sure stabilization 
on the sequences in X°° \ B c . Since this holds for all B c and n(X°° \ B c ) — > 1 as 
c — > 00, the maximizing element stabilizes with /x-probability one. □ 

For the rest of this section, we assume that the model class C contains only 
proper measures. A measure \i is called factorizable if there are measures ^ on X 
such that 

t{x) 

1=1 

for all x G X*. That is, the symbols of sequences x <00 generated by /i are inde- 
pendent. A factorizable measure \x = Yl^i i s called uniformly stochastic, if there is 
some 5 > such that at each time i the probability of all symbols a G X is either 
or at least 5. That is 

fii(a) > =>- fii(a) > 5 for all a G X and % > 1. (27) 

In particular, all deterministic measures and all i.i.d. distributions are uniformly 
stochastic. Another simple example of a uniformly stochastic measure is a proba- 
bility distribution which generates alternately random bits by fair coin flips and the 
digits of the binary representation of ir = 3.1415 . . . 

Lemma 22 Let fi, v , and v bc factorizable and uniformly stochastic measures, 
where \i is the true distribution. 

(i) The maximizing element chosen from fj, and v stabilizes almost surely. 
{%%) If fi is not eventually always preferred over v or v (in which case we the max- 
imizing element stabilizes a.s. by (i)), then the maximizing element chosen from v 
and v stabilizes almost surely. 
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Proof. We will show only (ii), as the proof of (i) is similar but simpler. So we assume 
that both v and v remain competitive in the process of choosing the maximizing 
element, and show that then maximizing element chosen from v and v stabilizes 
almost surely. 

Let v = Yli^i, v = Yli^ii an d Xi — ^fe] • The Xi are independent random 



variables depending on the event x <oc . Moreover, both fractions and 
are martingales (with respect to fx) and thus converge almost surely for t — > oo. We 
are interested only in the events in 



\%<oo £ <^°° : ^(xxt) conver g es to a value > j , 



since otherwise z/ eventually is no longer competitive. So we assume that fi(A v ) > 
0, which implies (J>(A V ) = 1 by the Kolmogorov zero-one-law (see e.g. |CT88j ). 
Similarly, /Ji(Ap) = 1 for the analogously defined set Ap. That is, 



t 



i=i 



VyXl-.t) = u(x 1 .. t ) j u(x 1:t ) 



converges to a value > almost surely, and in particular < Xi < oo a.s. 

Now we will use the concentration function of a real valued random variable U, 

Q(U,T]) = sup//(u < U < u + rj), rj > 0. (28) 

This quantity was introduced by Levy, see e.g. [Pet95] . The concentration function 
is non-decreasing in rj. Moreover, when two independent random variables U and V 
are added, we have |Pet95l Lemma 1.11] 



Q(U + V,ti)< min {Q(U, rf), Q(V, v)} V rj > 0. (29) 
We first assume that the following set is unbounded: 

B= j^(l-Q(Xi,77)) :nGN,?7>o| CK + , that is (30) 

sup(5) = +oo, (31) 



i=l 



We show that then (which converges a.s.) is not concentrated in the limit. 

That is, it converges to some given c > 0, in particular to c = with //-probability 
zero. This shows that almost surely it does not oscillate around — . 

Define independent random variables Y{ = ln(Xj). Let S n := 22 1 Yi and denote 
its almost everywhere existing limit by S = The assertion is verified under 

condition (|31|). if we can show that the distribution of S is not concentrated to any 
point since then also ni°^« = ex P(5') is n °t concentrated to any point. In terms 
of the concentration function defined in ()28|). this reads Q(S,0) = 0. According to 
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flSU), for each R > 0, we find r] > and n G N such that XT=i ( X ~ Q( X m v)) > R- 
Then, because of Xi < oo (ignoring the measure-zero set where this may fail), 

W = max Xi = max < Ui< f i \ : I < i < n and u(xi) > \ 
is finite. The mapping 

(0,W] 3 w ^ ln(w) G (-oo, InW] 

is bijective and has derivative at least W~ x . Let fj = Then by definition of Y{, 
we have Q(Yi, fj) < Q{X^ rj) for 1 < i < n and consequently 



n 

J2i}-Q(Yi,v))>R- 

i=l 



By the Kolmogorov-Rogozin inequality (see |Pet95| Theorem 2.15]), there is a 
constant C such that 



Q{S n ,r,)<c{j2{ l -Q( Y i^)) 

\i=l 



Thus, for each e > 0, we can choose R sufficiently large to guarantee C ■ R % < e. 
Then Q(S n ,fj) < e for n and fj as before. By (j2U|) we conclude 

Q(S,fj) = Q(s n +( J2 Y^),fj\ <Q(S n ,fj)<e 

\ i=n+l J 

and consequently Q(S, 0) = since Q is non-decreasing. This proves the assertion 
under assumption (pUf . 

Now assume that B is bounded, i.e. (J31|) does not hold. Then there is R > 
such that (1 — Q(Xi,r])) < R for all r] > and n G N. Since the distribution 
of Xi is a finite convex combination of point measures, for each i there is an rj > 
such that Q{X h rj) = Q(X h 0) and thus £? =1 (1 - Q(X h 0)) < R for all n G N. 
Therefore, also (1 — Q(Xi,0)) < R holds. Since Ui(xi) = CjZ/^Xj) is equivalent 
to Xi = Ci, this implies that there are constants Q > such that 

oo 

^2 [ii{a : i>i(a) j£ CiVi(a)} < R. (32) 

i=l 

Next we argue that if Cj ^ 1 for infinitely many i, then either z/ or v is eventually 
not competitive. To verify this claim, let N% = {a : Ui(a) ^ Qi^(a)} and Mi = X\Ni 
and observe that Hi(Ni) < 5 holds for sufficiently large i, since the sum (J32J) is 
bounded. On the other hand fi is uniformly stochastic, so there are no events of 
probability /ij(a) G (0,5), hence //j(iVj) = and /ij(Mj) = 1 for sufficiently large 
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i. Now for these z, Cj > 1 together with Ui(Mi) = 1 implies the contradiction 
Pi(Mi) = Ci > 1. So q > 1 necessarily requires Pi(Mi) < 1, hence Vi(Mi) < 1 — 6, 
since z/ is uniformly stochastic. If this happens infinitely often, then v is eventually 
not competitive. A symmetric argument with v holds for q < 1. 

The last paragraph shows that, if both v and v stay competitive, eventually 
i>i = Vi holds a.s. In this case, ^j^y is eventually constant, which completes the 
proof. □ 

Corollary 23 Let C be a countable class of factorizable and uniformly stochastic 
measures, then the maximizing element stabilizes almost surely. 

Proof. This follows from Theorem |^ and Lemma [221 1=1 

Lemma E21 and Corollary 123 are certainly not the only or the strongest assertions 
obtainable for stabilization. They rather give a flavor how a proof can look like, even 
if the distributions are not asymptotically mutually singular. On the other hand, 
the given result is optimal at least in some sense, as shown by the previous Examples 
ITHI and E3 In the former example, /i is not uniformly stochastic but both fj, and 
v are factorizable, while in the latter one, \x is uniformly stochastic but v is not 
factorizable. 

The proof of Lemma Wl\ crucially relies on the independence assumption, which 
is necessary in order to use the Kolmogorov-Rogozin inequality. It is possible to 
relax this and require independent sampling only "every so often" . It is however not 
clear how to remove this condition completely. 

8 Applications 

In the following, we present some applications of the theory developed so far. We 
begin by stating general loss bounds. After that, three very general applications are 
discussed. 

8.1 Loss bounds 

So far we have only considered special loss functions, like the square loss, the 
Hellinger loss, or the relative entropy. We now show how these results, in par- 
ticular the bounds for the Hellinger loss, imply regret bounds for arbitrary loss 
functions. (As we will see, square distance is not sufficient.) This parallels the 
bounds in [Hut 03at IHut03bj . The proofs are simplified, in particular Lemma [^] 
facilitates the analysis considerably. The reader should compare the results to the 
bounds for "prediction with expert advice", e.g. |CB97| IHP05J. 

In order to keep things simple, we restrict to binary alphabet X = {0, 1} in this 
section. Our results extend to general alphabet by the techniques used in |Hut03a| . 
Consider a binary predictor having access to a belief probability tp depending on the 
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current history, e.g. (p(xt = l\x <t ) = §■ Which actual prediction should he output, 
or 1? We can answer this question if we know the loss function, according to which 
losses are assigned to the (wrong) predictions. Consider for example the 0/1 loss 
(also known as classification error loss), i.e. a wrong prediction gives loss of 1 and 
a right prediction gives no loss. Then we should predict 1 if our belief is ip > |. 
This may be different under other loss functions. In general, we should predict in a 
Bayes optimal way: We should output the symbol with the least expected loss, 

x^ := arg min{(l — <p)£(0, x) + <p£(l, x)}, 

££{0,1} 

where £(x, x) is the loss incurred by prediction x if the true symbol is x. In the 
following, we will restrict to bounded loss functions £(x,x) 6 [0, 1]. Breaking ties in 
the above expression in an arbitrary deterministic way, the resulting prediction is 
deterministic for given (p and loss function i. If /i is the true distribution as usual, 
then let If := J2 a M a l x <iK( a ; x t ) be the /i-expected loss of the 93-predictor. Then, 
by 

n 

Lf m = E[lf + ... + Z£] = 5>(s<t)lf(s<«) 

t=\ 

we denote the cumulative /i-expected loss of the yj-predictor. With p> being the 
variants of the MDL predictor, we will bound the quantity A 1:n = Lf. n — V[. n , i.e. 
the cumulative regret, by an expression depending on V[. n and w~ l . 

We admit arbitrary non-stationary loss functions £ x<t which may depend on the 
history. Our analysis considers the worst possible choice of loss functions and consists 
of three steps. First the cumulative regret bound is reduced to an instantaneous 
regret bound (Lemma l2~2J). Then the instantaneous bound is reduced to a bound in 
terms of special functions of /1 and ip ( Lemma 12 5jl . Finally, the bound for the special 
functions is given (Lemma HHJ)- 

Lemma 24 Assume that some p-predictor satisfies the instantaneous regret bound 
5t = If — < 2h t + 2^/2htlf , where h t = h t ([i,p) is the Hellinger distance of the 
instantaneous predictive probabilities Wty . Then the cumulative p-regret is bounded 
in the same way: 

A 1:n = Lf n - L£ n < 2H l:n (fz, p) + 2v/2iJ 1:n (/i,¥^ :n . 

This and the following lemma hold with arbitrary constants, the choices 2 and 
2\/2 are the smallest ones for which Lemma I2TH is true. Note that if the Hellinger dis- 
tance is replaced by the relative entropy, then 2\/2 may be replaced by 2. Thus, nor- 
malized dynamic MDL and Bayes mixture admit smaller bounds, compare [Hut03a . 
However, this is not true for the other MDL variants, as we have no relative entropy 
bound there. 

Proof. The key property is the super- additivity of the bound. A function / : 
[0, oo) 2 — > [0, 00) is said to be super-additive if 

f(x x + x 2 , yi + y 2 ) > f(x x , yi) + f(x 2 , y 2 )- 
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The function (H, L) \— > v ' HL satisfies this condition. We now use an inductive 



argument. Assume < 2i?2 :ri + 2y 2^°^^°, where the summation starts at t = 
2 and the superscript indicates that the first symbol of the sequence was 0. Let the 
same hold for the first symbol 1. Writing fii = /i(l|e) and using 5i < 2hi + 2^/2hil^, 
we obtain 

A 1:n = 5 1 + (l-fi 1 )Al n + f i l Ai n 



< 2 



< 2 



hx + + (1 - A*l) (#2:n + \f^iJ4t) + 1*1 (HL + \flHlj¥n 



H 1:n + V / 2M 1 + J2((l - Hi)Hl.n + ViHL) ((1 " + Mi^fe! 



< 2if 1:n + 2y / 2H 1:n L^. n . 

Here, the first inequality is the induction hypothesis together with the instantaneous 
bound, the second bound is Cauchy-Schwarz's inequality, and the last estimate is 
the super-additivity. □ 



Lemma 25 Assume that some (p-predictor satisfies 5 < 2h + 2v 2h£ for all /j,,ip G 
[0,1], it/zi/i i7ie Hellinger distance h = h(fi,ip) and the special functions 8{[i,(p) and 
£(n, <p) defined in the following way, where we slightly abuse notation and abbreviate 
/i = /x(l| . . .) and ip = <p(l\ . . .): 

if V < tf < f, 
if V < V A \ < <f, 
«/§<<£>< A*, 

<p) if y < n f\y <\- 

Then for arbitrary bounded loss function £ : {0, l} 2 — > [0, 1], we have 

5<2h + 2V2hP. (33) 



\^jA and i= \ v(i-<p)/<p 

max{<£>, 1 — tp\ j 1 — /i 

(1-^/(1 



Proof. First we show that we may assume £(0, 0) = £(1, 1) = 0, i.e. we do not 
incur loss for correct predictions. To this end, consider the modified loss function 
£'(x, x) = £(x, x) — £(x, x) and assume w.l.o.g £'(x, x) G [0, 1]. Then it is not hard to 
see that the regrets under the original and the modified loss functions coincide, while 
the expected loss of the /i-predictor clearly decreases with the modified loss function. 
Thus, (l33|) holds for £ if it holds for £'. Hence we may assume £(0, 0) = £{l, 1) = 0. 
For each possible outcome i 6 {0,1}, we abbreviate £ x = £(x, 1 — x). 

Now assume w.l.o.g. /x < tp. In order to show the assertion, we need to consider 
the cases in the definition of £ separately. We show this only for the first case, i.e. 
A* < V 9 < 2 - Then / M = /j^ 1 , l v = (1 — fi)£°. We assume that the /x-predictor outputs 
and the yj-predictor 1, otherwise they give the same prediction and the (^-predictor 



26 



has no regret at all. This condition is equivalent to £° = i 1 -^- for some u G [//, ip\. 
We consider the worst case by maximizing l v , i.e. choosing u as large as possible. 
For this u = ip, we obtain £° = i 1 ^— and 

5 = l x [&$£. -fi] = t~5 < t[2h + 2v / 2/^] < 2h + 2^ 2MH <2h + 2V2hf, 
showing (}3*3*j) provided that \i < (p < |. The other cases are shown similarly. □ 

Lemma 26 The bound 8 < 2h + 2\/ 2hl holds for all //, ip E [0,1], with the functions 
5,£ : [0, l] 1 — > [0, 1] as defined in LemmaWfk 

The technical and not very interesting proof of this lemma is omitted. The 
careful reader may check the assertion numerically or graphically, as it is just the 
boundedness of some function on the unit square. We remark that the bound does 
not hold if the Hellinger distance is replaced by the quadratic distance, not even 
with larger constants. 

Theorem 27 For arbitrary non- stationary loss function which is bounded in [0, 1] 
and known to the MDL predictors, their respective losses are bounded by 

_ „ ^static ^static / ~~ 1 

r S?norm TQ TU J U norm f Mnorm i 9 / Q _ T IMxoxm — 1 1 Q™,, — ± 

L \:n ) L \:ni L \:n ) L l:n — ^l-.n + Z y /C -^l:n " W n + ^ cw f_L ; 

where the constant c = 2,8,21, or 32, according to which MDL predictor is used 
(compare Corollary\T^j. 

Proof. This follows from the above three lemmas and from if 1:n < c-w^ 1 (Corol- 
laryHU). □ 

This shows in particular that, regardless of the loss function, the average expected 
per-round regret tends to zero. Again, the direct practical relevance of the bounds 
is limited because of the potentially huge w^ 1 . 

8.2 Classification 

Transferring our results to pattern classification is very easy. All we have to do is 
to add inputs to our models. That is, we consider an arbitrary input space U and 
(as before) a finite observation or output space X . A model is now a measure 

v{x\u) G [0, 1], x G X, u G U, where u(x\u) = 1 for all u EU. 

That is, we have a distribution which is conditionalized to the input. We restrict 
our discussion to measures, since there is no motivation to consider semimeasures 
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for classification. The definition of a model does not include history dependence. 
There is no loss of generality: We may include the history in the arbitrary input 
space. 

Transferring the proofs in the previous sections to the present setup is straight- 
forward. We therefore obtain immediately the following corollaries. 

Corollary 28 Let C be a countable set of classification models containing the true 
distribution fi. Then for any sequence of inputs u <OQ G U, we have 

(Note that although each single model formally does not depend on the history, the 
MDL estimators necessarily do.) 

We need not consider the normalized static variant here, since all models are 
measures anyway. If there is a distribution over U, the result therefore also holds in 
expectation over the inputs. An analogue of Corollary is obtained as easily. If the 
inputs are i.i.d., which is usually assumed for classification, then the two conditions 
of factorizability and uniform stochasticity are trivially satisfied. Therefore, the 
true distribution /i is eventually discovered by MDL almost surely. Note that in 
this case, the distributions are also asymptotically mutually singular, so that the 
assertion also follows from Barron's |Bar85j earlier result. 

Note that again, the assumption /i e C is essential. In practical applications, if 
this is not clear, it may be therefore favorable to choose a different method having 
guarantees without this condition, compare |GL04| . 

8.3 Regression 

We may also apply our results in the regression setup, that is for predicting contin- 
uous densities. Our use of the term regression is a bit non-standard here, since it 
normally refers to just estimating the mean of some prediction, where the distribu- 
tion is often assumed to be Gaussian. Again the assumption /x e C is essential, so 
that in practice some other method not relying on it might be preferred. 

Continuous densities cause some additional difficulties. The observation space is 
now R. This implies in particular that, like for the loss bounds, the square distance 
is no longer appropriate for our purpose 5 (note that our use of the squared error 

5 To see this, define a distribution / by its density f n — ^X[-i,o] + lfX(o,-]> wnere X 1S 
the characteristic function of an interval. Let f(x) = f(—x), then the quadratic distance is 
f(f — f) 2 dx — 2p n _zl2? whereas the relative entropy f f \n{f / f)dx = ^ is constant. 



- Qnorm(a\Ut,U <t ,X <t )) < 2w ^ , 

- Q(a\u t ,u <t ,x <t )) 2 < 8W- 1 , 
-Q static (a\u t ,u <t ,x <t )) 2 < 21w~\ 
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is completely different from the standard use in regression). So we will use the 
Hellinger distance instead, defined similarly to by 

Hf, /) = / (V7R - \[Hx)) 2 dx for integrable /, / : R - [0, oo). (34) 

Accordingly, Hi :n (fi,<f) = ^ t E/i(//(-|w t ), <f(-\u t , u <t , x <t )) is the cumulative 
Hellinger distance of two predictive distributions fi and <p. Similarly as in ("23j) 
and ("2l"]). the Hellinger distance is bounded by the (continuous) relative entropy 
and absolute distance. This shows in particular that the integral (f3l|) exists. 

We now consider a countable class C of models that are functions v from U to 
uniformly bounded probability densities on X = R. That is, there is some C > 
such that 

/oo 
^(x|w)<ix = 1 for all i > 1, u G W, and x G R. (35) 
-oo 

for all i > 1, m G ZY, and x G R. The MDL estimator is then defined as the ele- 
ment which maximizes the density, v* = argmax i , e c{w ; ,z/(xi :n |'Ui :n )} • The uniform 
boundedness condition asserts that the MDL estimator exists. It may be relaxed, 
provided that the MDL estimator remains well-defined, such as for a family of Gaus- 
sian densities which tend to the point measure. 

With these definitions, the proofs of the theorems for static and dynamic MDL 
can be adapted. Since the triangle inequality holds for the Hellinger distance VlP, 
we obtain the following. 

Corollary 29 Let C be a countable model class according to containing 
the true distribution fi. Then for any sequence of inputs u <00 G U, we have 
Hi :n (ii, g amm ) < 2w-\ H 1:n (n, g) < 8W' 1 , and H 1:n (fi, g static ) < 21m" 1 . 

We may apply this for example to model classes with Gaussian noise, concluding 
that the mean and the variances converge to the true values, see |PH05| for an 
example. It is not immediately clear how to obtain an analogue of Corollary 12*31 for 
continuous densities. 

8.4 Universal Induction 

Since the assertions on static and dynamic MDL have been proven generally for 
semimeasures, we may apply them to the universal setup. Here C = M. is the 
countable set of all lower semicomputable (= enumerable) semimeasures on X* . So 
M. contains stochastic models in general, and in particular all models for computable 
deterministic sequences. There is a one-to-one correspondence of M. to the class of 
all programs on some fixed universal monotone Turing machine U, see e.g. |LV97j . 
We will assume programs to be binary, in contrast to outputs, which are strings 
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x G X*. This relation defines in particular the complexities and weights of each v 
by 

Kw(v) = length of the program for v on U, and w v = 2 Kw ^ u \ (36) 
We call these weights the canonical weights. They satisfy w u > for all v and 

An enumerable semimeasure which dominates all other enumerable semimeasures 
is called universal. The Bayes mixture £ defined in (J2J) has this property. One can 
show that £ is equal within a multiplicative constant to Solomonoff 's prior Sol64, 
Eq. (7)], which is the a priori probability that (some extension of) a string x is 
generated provided that the input of U consists of fair coin flips. That is 

£(x) = M(x) = J2 2 ' iiP) for a11 x e X *- 

p minimal: U(p)=x* 

Here, we use the notations 

f<g--^f<g + 0(1), f = g:^f<gAg<f, 
f<g:^f<g- 0(1), / = g f < g A g < f. 

The MDL definitions in Section El directly transfer to this setup. All bounds on 
the cumulative square loss (subsumed in Corollary [T4^ therefore apply to g = Q[m\- 
The necessary assumption now reads that fi must be a recursive (= computable) 
measure. Also, Theorem implies Solomonoff 's important universal induction the- 
orem. 

In addition to A4, we also consider the set of all recursive measures M. together 
with the same canonical weights (|3*S|) . We define £ = an< ^ 6 = Q[MV Then 
q{x) < £,(x) < and g(x) < £(x) for all x G X* is immediate. It is straightforward 

X 

that £(x) < g(x) since £ G M. Moreover, for any string x G X*, define the monotone 
complexity Km(x) = min{£(p) : U(p) = x*} as the length of the shortest program 
such that U's output starts with x. The following assertion holds. 

_ + 

Proposition 30 We have Kg > Km. 

Proof. We must show that given a string x G X* and a recursive measure v (which 
in particular may be the MDL descriptor v*(xj) it is possible to specify a program 
p of length at most Kw{v) + Kv(x) + c that outputs a string starting with x, where 
constant c is independent of x and v. 

Consider all strings G X n (1 < % < \X\ n ) of length n = £(x) arranged in 
lexicographical order. Each yi has measure Pi = p(yi). Let Si be the cumulated 
measures: So = and Si = Yl\=\Pk- Let j be the index of x, i.e. x = yj. Then, 
the interval [Sj-i, Sj) C [0,1) has measure Pj and therefore contains exactly one 
\— log 2 Pj]-bit number z G [Sj-i, Sj). We describe x with the number z, this is known 
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as arithmetic encoding (see e.g. |UT91j ). The coding is injective since [jSi_i, Si) and 
[Sk-i, Sk) are disjoint for i ^ k. 

In order to decode z, we may descend the | X\-axy tree of all possible strings y, first 
considering strings of length one, then of length two, etc. For each possible string 
y, we can determine its binary code by approximating u(x) sufficiently accurately. 
Eventually we will find z, then we print the current y. At this stage, y might be 
only a prefix of x, since an extension of y might have a measure very close to y and 
thus map to the same code z. Therefore we continue the procedure until all codes 
starting with z are proper extensions of z (which may never be the case, then the 
algorithm runs forever). In each step, the appropriate additional symbol is written 
on the output tape. The resulting output will be x or some extension of x. 

This algorithm can be specified in a constant c' number of bits. The description 
of v needs another Kw[y) bits. Finally, z has length |~— log 2 Pj] < — log 2 z/(x) + 1. 
Thus, the overall description has length Kwiv) + Kv[x) + c as required. □ 

It is also possible to prove the proposition indirectly using |LV97[ Thm.4.5.4]. 

+ 

This implies that Km[x) < Kw(u) + Ku(x) for all x E X* and all recursive measures 

+ 

v <E M.. Then, also Km{x) < mm{Kw(h') + Kv(x)} = Kg(x) holds. 
So together with the above observations, we have 

Km(x) < Kq{x) > K£(x) > Kg(x) = KM(x). (37) 

On the other hand, there is a deep result in Algorithmic information theory which 
states that an exact coding theorem does not hold on continuous sample space, 

Km(x) > KM(x) |Gac83j . Therefore, at least one of the above > must be proper. 

X X 

Problem 31 Which of the two inequalities Kq(x) > K^{x) and K^{x) > Kg(x) is 
proper (or are both)? 



The proof in |Gac83] is very subtle, and the phenomenon is still not completely 



understood. There is some hope that by answering Problem |^ one arrives at a 
better understanding of the continuous coding theorem and even at a simpler proof 
for its failure. 



9 Discussion 

In this last section, we recapitulate the main achievements of this work and discuss 
their philosophical and practical consequences. In the first place, we have shown 
that if two-part MDL is used for predicting a stochastic sequence, then the predictive 
probabilities converge to the true ones in mean sum, provided that the distribution 
generating the sequence is contained in the model class. The two most important 
implications are almost sure convergence and loss bounds for arbitrary loss functions. 
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The guaranteed convergence is slow in general: All bounds depend linearly on 
w~ x , the inverse of the prior weight of the true distribution. For large model classes, 
this number must be regarded too huge to be relevant for practical applications. 
Examples show that this bound is sharp. This is in contrast to the exponentially 
smaller corresponding bound for the Bayes mixture. The latter predictor however is 
often computationally more expensive to approximate in practice. We believe that 
this principally indicates that with MDL, some care has to be taken when choosing 
the model class and the prior. Conditions which are sufficient for fast convergence 
have been given for instance in [His96llRRY98llPHn4bj . It remains a major challenge 
to generalize these results in order to obtain fast convergence under assumptions 
that are as weak as possible. In particular for universal induction, this question is 
interesting and possibly difficult. Even when considering only computable Bernoulli 
distributions endowed with a universal prior, fast convergence possibly holds for 
many environments, but maybe not for all [PH04b . We also need to distinguish 
how the large error cumulates. Either the instantaneous error remains significant 
for a long time, which is critical, or the instantaneous error drops just too slowly 
to be summable, e.g. as O(-), which is tolerable. We have seen instances for both 
cases; compare the discussion after Example in this light, the cumulative error 
might not be the right quantity to assess convergence speed. 

The main results have been shown under the only assumption that the data 
generating process is contained in the model class. This condition is essential in 
general, as |GL04| shows that in its absence MDL can fail dramatically. In the 
universal setup, the assumption merely requires that the data is generated in some 
(probabilistically) computable way. This is a very weak condition. Laplace, Zuse 
|Zus67j and successors argue that nature operates in a computable way, and conse- 
quently all thinkable data satisfies the assumption. On the other hand, predicting 
with a universal model is computationally very expensive. In particular it is prov- 
ably infeasible if the thesis of computable nature holds. Despite these practical 
problems, the theory of universal prediction is valuable since it explores the limits 
of computational induction. 
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